Skip to content

fix(metacat): handle unmapped entity tokens and training crashes#399

Merged
mart-r merged 4 commits intoCogStack:mainfrom
bgriffen:fix/metacat-training-crashes
Apr 7, 2026
Merged

fix(metacat): handle unmapped entity tokens and training crashes#399
mart-r merged 4 commits intoCogStack:mainfrom
bgriffen:fix/metacat-training-crashes

Conversation

@bgriffen
Copy link
Copy Markdown
Contributor

@bgriffen bgriffen commented Apr 5, 2026

Summary

Fixes three bugs in MetaCAT supervised training that cause crashes during train_supervised_raw(train_addons=True):

  • Unmapped entity tokens crash inference — entities whose character offsets don't align with any tokenised token cause UnboundLocalError and KeyError during MetaCAT inference within NER training epochs
  • Same crash in training data preparation — identical pattern in prepare_from_json_prepare_from_json_loop causes IndexError during train_raw
  • _train_meta_cat missing save_dir_pathtrain_raw requires a save directory when auto_save_model=True (the default config), but _train_meta_cat never provides one. This crashes after all NER epochs complete, losing hours of training progress.

Reproduction

All three bugs are triggered by calling train_supervised_raw(data, train_addons=True) with MetaCAT addons registered. The token-mapping bugs (1 & 2) occur when entity character spans fall at document boundaries or in whitespace-only regions where no BPE tokens are produced. The save_dir_path bug (3) occurs unconditionally with the default MetaCAT config.

Changes

Commit 1: meta_cat.py — inference path

  • prepare_document: skip entities with empty ctoken_idx (prevents UnboundLocalError on ind)
  • _set_meta_anns: guard ent_id2ind lookup for skipped entities (prevents KeyError)

Commit 2: data_utils.py — training data path

  • _prepare_from_json_loop: skip annotations with empty ctoken_idx (prevents IndexError on ctoken_idx[0])

Commit 3: trainer.py — MetaCAT addon training

  • _train_meta_cat: create a temporary directory and pass it as save_dir_path to train_raw when auto_save_model is enabled

Test plan

  • train_supervised_raw(train_addons=True) completes without crash on documents containing entities near text boundaries
  • MetaCAT inference does not crash on entities that fail to map to BPE tokens
  • train_raw receives a valid save_dir_path and saves best checkpoint during training
  • Entities that successfully map to tokens are unaffected (no behaviour change)

bgriffen added 3 commits April 5, 2026 20:06
When an entity's character offsets fall past the end of the tokenised
text (or into a whitespace-only region), the token-mapping loop in
`prepare_document` produces an empty `ctoken_idx` list. This causes:

1. `UnboundLocalError` on `ind` at line 739 — the loop variable is
   never assigned because `offset_mapping[last_ind:]` is empty or no
   pair matches the entity span.

2. `KeyError` in `_set_meta_anns` — skipped entities are absent from
   `ent_id2ind` but the annotation loop does an unconditional lookup.

Fix: skip entities with empty token mappings in `prepare_document`,
and guard the dict lookup in `_set_meta_anns`.
`_prepare_from_json_loop` in `data_utils.py` has the same token-mapping
pattern as `prepare_document` in `meta_cat.py` (fixed in previous
commit), but in the `train_raw` → `prepare_from_json` code path.

When an annotation's character offsets don't align with any token in
the offset_mapping, `ctoken_idx` is empty and `ctoken_idx[0]` raises
`IndexError`.

Fix: skip annotations with empty token mappings, consistent with the
inference path fix.
`_train_meta_cat` calls `addon.mc.train_raw(data)` without providing
`save_dir_path`. When the MetaCAT config has `auto_save_model=True`
(the default), `train_raw` raises an exception because it needs a
directory to save the best checkpoint during training.

This crash occurs after all NER supervised training epochs have
completed (potentially hours of work), since `_train_addons` is
called at the very end of `train_supervised_raw`.

Fix: create a temporary directory for the MetaCAT save checkpoint
when `auto_save_model` is enabled, and pass it to `train_raw`.
Copy link
Copy Markdown
Collaborator

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for identifying and fixing the issue! We don't get many outside contributions so seeing one is such a treat!

With that said, I've left a few comments on the temporary file handling in MetaCAT training. The previous approach would keep saving these to new locations if you went through multiple datasets and never clean them up. This could become problematic.

PS:
I would also like the MetaCAT process to be able to have debug output during inference in terms of logging. But since there's currently none, we'll need to look at that separately.

# train_raw requires save_dir_path when auto_save_model is True
save_dir = None
if cnf.train.auto_save_model:
import tempfile
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move import to the top of the module. There's no reason for this to be a dynamic import.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — moved to module-level imports in 5796e13.

save_dir = None
if cnf.train.auto_save_model:
import tempfile
save_dir = tempfile.mkdtemp(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method creates a persistent temporary data path that would need to be cleared when done using it.

I would prefer a context manager. We probably just always do that - when the save directory isn't used nothing is written to it. The overhead of creating (and removing) this folder even when it's not used at training time is going to be negligible (it's a 1 time operation per MetaCAT per dataset).

So I would prefer something along the lines of:

with tempfile.TemporaryDirectory() as save_dir:
    # NOTE: this is a mypy quirk - the types are compatible
    addon.mc.train_raw(cast(dict, data), save_dir_path=save_dir)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — switched to TemporaryDirectory context manager in 5796e13. The best weights are loaded back into memory before train_raw returns, so the directory can safely be cleaned up on exit.

Address review feedback: move tempfile import to module top and use
TemporaryDirectory context manager instead of mkdtemp to avoid leaking
orphaned temp directories across multiple training runs.
@mart-r mart-r merged commit 61d3bb0 into CogStack:main Apr 7, 2026
19 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants